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Most current machine learning works 
well because of human-designed 
representations and input features 
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Machine learning becomes just optimizing 
weights to best make a final prediction 


Representation learning attempts to 
automatically learn good features or representations 


Deep learning algorithms attempt to learn multiple levels of 
representation of increasing complexity/abstraction 


A Deep Architecture 


Mainly, work has explored deep belief networks (DBNs), Markov 
Random Fields with multiple layers, and various types of 
multiple-layer neural networks 


Output layer AA sd 


Here predicting a supervised target 


Hidden layers 


These learn more abstract 
representations as you head up 


Input layer 


3 Raw sensory inputs (roughly) 


Part 1.1: The Basics 


Five Reasons to Explore 
Deep Learning 


# 1 Learning representations 


Handcrafting features is time-consuming 

The features are often both over-specified and incomplete 

The work has to be done again for each task/domain/... 

We must move beyond handcrafted features and simple ML 

Humans develop representations for learning and reasoning 
Our computers should do the same 


Deep learning provides a way of doing this 
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3: The need for distributed 
representations 


Current NLP systems are incredibly fragile because of 
their atomic symbol representations 


NP ADVP VP 


PRPS NN RB VBZ NP 


My dog also eats NNS 


oranges 


HR The need for distributed 
representations 


Learned word representations that model similarities 
help enormously in NLP 


Distributional similarity based word clusters greatly help most 
applications 


+1.4% F1 Dependency Parsing 15.2% error reduction (Koo & 
Collins 2008, Brown clustering) 


+3.4% F1 Named Entity Recognition 23.7% error reduction 
(Stanford NER, exchange clustering) 


FER the need for distributed à 2 ç 
representations 


Clustering HUILE b 
E Sub— Talis 3 
. . Clustering cm Sub-partition 2 
A 
pd "ie 
A regions | Le" 

x i lead 

x prototypes 
à X 


LOCAL PARTITION 


DISTRIBUTED PARTITION \ 


Learning features that are not mutually exclusive can be exponentially 
more efficient than nearest-neighbor-like or clustering-like models 


Distributed representations deal with 
the curse of dimensionality 


Generalizing locally (e.g., nearest 
neighbors) requires representative 
examples for all relevant variations! 


Be 1 dimension: 
10 positions 
e 


2 dimensions: 
100 positions 
© 


Classic solutions: 
* Manual feature design 


* Assuming a smooth target 
function (e.g., linear models) 


e Kernel methods (linear in terms 
of kernel based on data points) 


» 3 dimensions: 
1000 positions! 


Neural networks parameterize and 
learn a "similarity" kernel 


#3 Unsupervised feature and 


weight Learning 


Today, most practical, good NLP& ML methods require 
labeled training data (i.e., supervised learning) 


But almost all data is unlabeled 


Most information must be acguired unsupervised 


Fortunately, a good model of observed data can really help you 
learn classification decisions 


H+ Learning multiple Levels of 
representation 


Biologically inspired learning 


e The cortex seems to have a generic learning 
algorithm 


e The brain has a deep architecture 


Task 1 Output JA Task 2 Output} (Task 3 Output 


We need good intermediate representations 
that can be shared across tasks 


Multiple levels of latent variables allow 
combinatorial sharing of statistical strength 


e Insufficient model depth can be 1 
exponentially inefficient 
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[Lee et al. ICML 2009; Lee et al. NIPS 2009] 
Successive model layers learn deeper intermediate representations 


High-level 


^m d ^5 us Layer 3 linguistic representations 


TITEL HE 


VS, A EG 


Handling the recursivity of human 
Language 


Z Z Zi+1 


A small crowd 


Human sentences are composed 
from words and phrases 


We need compositionality in our Xi 
ML models 


Recursion: the same operator 


i i quietly enters 
(same parameters) is applied S cos 
repeatedly on different y Up church 
(0000000) Semantic 
components m e E P, Representations 
A small quietly P 
crowd enters Det/ Adj, ~N. 


10000000) (2000000 JJ MCLEIXEILIJ DN 


CE LE CE 
; historic church 
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46 Why how? 


Despite prior investigation and understanding of many of the 
algorithmic techniques ... 


Before 2006 training deep architectures was unsuccessful © 


What has changed? 


e New methods for unsupervised pre-training have been 
developed (Restricted Boltzmann Machines = RBMs, 
autoencoders, contrastive estimation, etc.) 


e More efficient parameter estimation methods 
e Better understanding of model regularization 


Deep Learning models have alread 
achieved impressive results for HL 


e Model V WSJ task Eval WER | 
, KN5 Baseline 17.2 


Q 2 Discriminative LM 16.9 


Recurrent NN combination 14.4 


MSR MAVIS Speech System CRUENTUM: — 
[Dahl et al. 2012; Seide et al. 2011; a Elect 
following Mohamed et al. 2011] 8 


| ; GMM 40-mix, 1-pass 27.4 23.6 
co 4 BMMI, SWB 309h  -adapt 


Neural Language Model 
[Mikolov et al. Interspeech 2011] 


“The algorithms represent the first timea CD-DNN 7layerx  1-pass lm cS 
company has released a deep-neural- 2048, SWB 309h -adapt (33%) (32%) 
networks (DNN)-based speech-recognition GNM 72-mix, k-pass 18.6 17.1 


algorithm in a commercial product.” BMMI, FSH 2000h +adapt 
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Deep Learn Models Have Interesting 
Performance Characteristics 


Deep learning models can now be very fast in some circumstances 
* SENNA [Collobert et al. 2011] can do POS or NER faster than 
other SOTA taggers (16x to 122x), using 25x less memory 
e WSJ POS 97.29% acc; CoNLL NER 89.59% F1; CoNLL Chunking 94.32% F1 
Changes in computing technology favor deep learning 
* In NLP, speed has traditionally come from exploiting sparsity 


e But with modern machines, branches and widely spaced 
memory accesses are costly 


* Uniform parallel operations on dense vectors are faster 


These trends are even stronger with multi-core CPUs and GPUs 
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Good work -- but 1 think. 
we might need a little 
more detail right here. 
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Outline of the Tutorial 


1. The Basics 


1. 


d LE AE 


7. 


Motivations 

From logistic regression to neural networks 
Word representations 

Unsupervised word vector learning 
Backpropagation Training 

Learning word-level classifiers: POS and NER 
Sharing statistical strength 


2. Recursive Neural Networks 


3. Applications, Discussion, and Resources 
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Outline of the Tutorial 


1. The Basics 
2. Recursive Neural Networks 


1. 


Do all all o 


ré 


Motivation 

Recursive Neural Networks for Parsing 

Theory: Backpropagation Through Structure 

Recursive Autoencoders 

Application to Sentiment Analysis and Paraphrase Detection 
Compositionality Through Recursive Matrix-Vector Spaces 
Relation classification 


3. Applications, Discussion, and Resources 
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Outline of the Tutorial 
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The Basics 
Recursive Neural Networks 
Applications, Discussion, and Resources 
1. Applications 

1. Neural language models 


2. Structured embedding of knowledge bases 
3. Assorted other speech and NLP applications 


2. Resources (readings, code, ...) 
3. Tricks of the trade 


4. Discussion: Limitations, advantages, future directions 


Part 1.2: The Basics 


From Logistic regression to 
neural neks 


Demystifying neural networks 


Neural networks come with 
their own terminological 
baggage 


… just like SVMs 


But if you understand how 
maxent/logistic regression 
models work 

Then you already understand the 


operation of a basic neural 
network neuron! 
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A single neuron 
A computational unit with n (3) inputs 
and 1 output 
and parameters W, b 


Inputs Activation Output 
function 


Bias unit corresponds to intercept term 


From Maxent Classifiers to Neural 
Networks 


In NLP, a maxent classifier is normally written as: 


exp > Af(C,d) 
EON 
2,52, JC sd) 
Supervised learning gives us a distribution for datum d over classes in C 
e^ fed) 


Vector form: P(cld,À)- 
| » e^ (e 
Such a classifier is used as-is in a neural network (“a softmax layer”) 
* Often as the top layer 


But for now we'll derive a two-class logistic model for one neuron 
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From Maxent Classifiers to Neural 
Networks 


À! f (c,d) 


D e^ fe 
C 


A" f (c,d) or fei d) e^ fa d) 


Vector form: = P(cld,A)- 


Make two class: 


P(c, |d,A)= 


e 


e^ fad) + 0% f (0 E e^ fad) A! f (c, d) e^ feo 


TC 


I 


1 
= OF eC eee 
| 4 e^ 07 G TE f(c,,d)- f(c,,d) 


= f(4 x) 


for f(z) = 1/(1 + exp(-z)), the logistic function — a sigmoid non-linearity. 


-6 -4 -2 
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This is exactly whak a neuron 
computes 


b: We can have an “always on” 


h, (x)= f(w'x + b) <— feature, which gives a class prior, 
w,b | 
or separate it out, as a bias term 
1 
=z 


d ree: 


+e 


1 


Xi 


X2 
hw p(X) 
X3 


+1 w, b are the parameters of this neuron 
25 i.e., this logistic regression model 


A neural nelwork = running several 
Logistic regressions at the same time 


If we feed a vector of inputs through a bunch of logistic regression 
functions, then we get a vector of outputs 


But we don’t have to decide 
ahead of time what variables 
these logistic regressions are 
trying to predict! 


Layer L, 
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A neural nekwork = running several 
Logistic regressions at the same time 


... which we can feed into another logistic regression function 


and it is the training 
criterion that will 
decide what those 
intermediate binary 
target variables should 
be, so as to do a good 
job of predicting the 
targets for the next 
layer, etc. 


Layer L, Layer L; 
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A neural network = running several 
Logistic regressions al the same time 


Before we know it, we have a multilayer neural network... 


Matrix notation for a Layer 


We have 
d, = f (Wx, + Wx, + WX; + b) 
a, = f (Wx, + WX, + W,,x, + b,) 
etc. 

In Matrix notation 


z=Wx+b 
a= f(z) 


where fis applied element-wise: 


f (2.2.2) = [f f (2), f(s) Layer L, 


How do we train the weights W? 


e Fora supervised single layer neural net, we can train the model 
just like a maxent model — we calculate and use gradients 
* Stochastic gradient descent (SGD) 


* Conjugate gradient or L-BFGS 


e A multilayer net could be more complex because the internal 
("hidden") logistic units make the function non-convex ... just as 
for hidden CRFs  [Quattoni et al. 2005, Gunawardana et al. 2005] 

* But we can use the same ideas and techniques 
e Just without guarantees ... 


* This leads into "backpropagation", which we cover later 
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Non-Linearities: Why they’ re needed 


e For logistic regression: map to probabilities 


e Here: function approximation, 
e.g., regression or classification 


* Without non-linearities, deep neural networks 
can’t do anything more than a linear transform 
e Extra layers could just be compiled down into 
a single linear transform 
e Probabilistic interpretation unnecessary except in 
the Boltzmann machine/graphical models 


e People often use other non-linearities, suchas  ! M = 10 
tanh, as we'll discuss in part 3 


Summar 
Knowing the meaning of words! 


You now understand the basics and the relation to other models 
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Neuron = logistic regression or similar function 

Input layer = input training/test vector 

Bias unit = intercept term/always on feature 

Activation = response 

Activation function is a logistic (or similar “sigmoid” nonlinearity) 


Backpropagation = running stochastic gradient descent across a 
multilayer network 


Weight decay = regularization / Bayesian prior 


Effective deep Learning became possible 
through unsupervised pre-training 


[Erhan et al., JMLR 2010] 


(with RBMs and Denoising Auto-Encoders) 


Purely supervised neural net With unsupervised pre-training 


m 


N 


test classification error (perc) 


33 number of layers number of layers 


0—9 handwritten digit recognition error rate (MNIST data) 


Part 1.3: The Basics 


Word Representations 


The standard word representation 


The vast majority of rule-based and statistical NLP work regards 
words as atomic symbols: hotel, conference, walk 


In vector space terms, this is a vector with one 1 and a lot of zeroes 

[o00000000010000)] 
Dimensionality: 20K (speech) — 50K (PTB) — 500K (big vocab) — 13M (Google 1T) 
We call this a “one-hot” representation. Its problem: 


motel [000000000010000] AND 
hotel [(ooooooo10000000]) 2 o 
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Distributional similarity based 
representations 


You can get a lot of value by representing a word by 
means of its neighbors 


“You shall know a word by the company it keeps” 
(J. R. Firth 1957: 11) 


One of the most successful ideas of modern statistical NLP 


banking 
banking 


R These words will represent banking A 


You can vary whether you use local or large context 
36 to get a more syntactic or semantic clustering 


Class-based (hard) and soft 
clustering word representations 


Class based models learn word classes of similar words based on 
distributional information ( ~ class HMM) 


* Brown clustering (Brown et al. 1992) 
e Exchange clustering (Martin et al. 1998, Clark 2003) 
e Desparsification and great example of unsupervised pre-training 


Soft clustering models learn for each cluster/topic a distribution 
over words of how likely that word is in each cluster 


* Latent Semantic Analysis (LSA/LSI), Random projections 
e Latent Dirichlet Analysis (LDA), HMM clustering 


37 


Neural word embeddings 
as a distributed representation 


Similar idea 


Combine vector space 

semantics with the prediction of 

probabilistic models (Bengio et 

al. 2003, Collobert & Weston 

2008, Turian et al. 2010) linguistics = 


In all of these approaches, 
including deep learning models, 
a word is represented as a 
dense vector 
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0.286 
0.792 
-0.177 
-0.107 
0.109 
-0.542 
0.349 
0.271 


Neural word embeddings - 
visualization 


need help 
come 
go 
take 
give keep 
make get 
meet — continue 
expect want become 
think 
say remain 
are . 
T is 
wergas 
being 
been 
39 had, os 


have 


Advantages of the neural word 
embedding approach 


Compared to a method like LSA, neural word embeddings 
can become more meaningful through adding supervision 
from one or multiple tasks 


For instance, sentiment is usually not captured in unsupervised 
word embeddings but can be in neural word vectors 


We can build representations for large linguistic units 


See part 2 
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Part 1.4: The Basics 


Unsupervised word vector 
Learning 


A neural nekwork for Learning word 
vectors (Collobert et al. JMLR 2011) 


Idea: A word and its context is a positive training 
Sample; a random word in that same context gives 
a negative training sample: 


= Scat chills on a mat = cat chills Jeju a mat 


Similar: Implicit negative evidence in Contrastive 
Estimation, (Smith and Eisner 2005) 


A neural network for learning word 
vectors 


How do we formalize this idea? Ask that 


score(cat chills on a mat) > score(cat chills Jeju a mat) 


How do we compute the score? 


e With a neural network 


e Each word is associated with an 
n-dimensional vector 
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Word embedding matrix 


Initialize all word vectors randomly to form a word embedding 
matrix Z, c R”XIVI 


IV| 
e 0 O © © 
© © O © © 
| = © © O e ol, 
e 0 O © © 


the cat mat... 


These are the word features we want to learn 


Also called a look-up table 


e Conceptually you get a word's vector by left multiplying a 
one-hot vector e by L: x=Le 
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Word vectors as input to a neural 
network 


e score(cat chills on a mat) 


* To describe a phrase, retrieve (via index) the corresponding 
vectors from L 


e © o 0 0 
O O O O O 
© © + 0 0 
e © © © © 
cat chills on a mat 


e Then concatenate them to 5n vector: 
e X =| 0000 0000 0000 0000 0000 | 


* How do we then compute score(x)? 
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A Single Layer Neural Network 


* Asingle layer is a combination of a linear layer 
and a nonlinearity: z = Wx+b 


a = f(z) 
e The neural activations can then 
be used to compute some function. 


e For instance, the score we care about: 
score(z) = U'aeR 


Summary: Feed-forward Computation 


Computing a window's score with a 3-layer Neural 
Net: s = score(cat chills on a mat) 


s = UT f(Wa +b) x € Rex! W = RAT - REXI 


s = Ula 
a = f(z) ©0000 0000) 
z = Wzr+b 


I = [Teat Tehills Ton La Tmat] (V900 0000 0000 0000 0000 


LE Rr” XIV! cat chills on a mat 
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Summary: Feed-forward Computation 


e s = score(cat chills on a mat) 
e s. - score(cat chills Jeju a mat) 
* |dea for training objective: make score of true window 


larger and corrupt window's score lower (until they're 
good enough): minimize 


J = max(0,1 — s + se) {= 


e This is continuous, can perform SGD 


Training with Backpropagation 


s = U^ f(Wz +b) 
J = max(0,1 — s + se) s, = UT (We, + b) 
Assuming cost J is > 0, it is simple to see that we 
can compute the derivatives of s and s, wrt all the 
involved variables: U, W, b, x 
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Training with Backpropagation 


e Let's consider the derivative of a single weight W; 


ðs O0 r, yr Ə oy 
hi ME LN NM e b 
aw awl *-aw" 19 = gy" Web 


* This only appears inside a; U, 
e For example: W^», is only 


W 
used to compute a, = 


Training with deu ds “ea Janta 


So = SUT = UT f(z) = UT (We +0 
Derivative of weight W;; B = be 
EN , poti | = vn E LER 
Uiga = U, n TA 
= UE. 
= if lage 
OW;. x + b; 


= Bee qM 
51 OW;; 


Training with Backpropagation 


Derivative of single weight W;: 
= Uf GU AW 


o 
7] k 


= Uif (zi) Lj 
— 
Local error Local input 
signal signal 
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Training with Backpropagation 


* From single weight W; to full W: 
0J 
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— Uif’ (zi) Tj 
— 


= Ô; Lj 


OWi;; 


Zi 


ay = 


We want all combinations of 
i=1,2andj=1,2,3 

e OJ T 
Solution: Outer product: ams Ox 
where § c g?x1is the 
"responsibility" coming from 
each activation a 


Training with Backpropagation 


e For biases b, we get: 


02 


2 
OW;.x + b; 
Uif (zi : : 
ro). 


= 


Training with Backpropagation 


That's almost backpropagation 


It's simply taking derivatives and using the chain rule! 


Remaining trick: we can re-use derivatives computed for 
higher layers in computing derivatives for lower layers 


Example: last derivatives of model, the word vectors in x 
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Training with Backpropagation 
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as 


#3 


Take derivative of score with 
respect to single word vector 
(for simplicity a 1d vector, 
but same if it was longer) 


Now, we cannot just take 
into consideration one a; 
because each x, is connected 
to all the neurons above and 
hence X; influences the 
overall score through all of 
these, hence: 


Re-used part of previous derivative 


wl 


2 
3 UL Via + b) e 
i—1 No Aaa 


2 
> GW: 
i=l 


^ W.j 


Training with Backpropagation: 
softmax 
What is the major benefit of learned word vectors? 


Ability to also propagate labeled information into them, 
via softmax/maxent and hidden layer: 


A' f (c,d) 


D efe 
C 


C x2 
P(cld,À)- SER” 


Part 1.5: The Basics 


Backpropagation Training 


Back-Prop 


* Compute gradient of example-wise loss wrt 
parameters 


e Simply applying the derivative chain rule wisely 
EN E Oz || Oz OY 
z= f(y) y —9(x) 8; = dy de 
e |f computing the loss(example, parameters) is O(n) 
computation, then so is computing the gradient 


Simple Chain Rule 


Yd 


Multiple Paths Chain Rule 


Oz _ Oz Oyi 
X Or Oy, Ox 


Oz Oy2 
Oyo Ox 


Multiple Paths Chain Rule - General 
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Chain Rule in Flow Graph 
FA 


Flow graph: any directed acyclic graph 
node = computation result 
arc = computation dependency 


Ty, ya, ... Un } = successors of L 


Oz OY; 
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Back-Prop in Multi-Layer Net 
NLL = —log P(Y = ylx) 


Back-Prop m General Flow Graph 


Single scalar output Ž 


1. Fprop: visit nodes in topo-sort order 
- Compute value of node given predecessors 
2. Bprop: 
- initialize output gradient = 1 
- visit nodes in reverse order: 
Compute gradient wrt each node using 
gradient wrt successors 


191, Y2, --- Yn} = successors of X 


Oz ys 0e Oy 
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Automatic Differentiation 


* The gradient computation can 


e 70 be automatically inferred from 


the symbolic expression of the 
fprop. 


Each node type needs to know 
how to compute its output and 
how to compute the gradient 
M wrt its inputs given the 

A gradient wrt its output. 


* Easy and fast prototyping 
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Part 1.6: The Basics 


Learning word-Level classifiers: 
POS and NER 


The Model 


(Collobert 8. Weston 2008; 

Collobert et al. 2011) 

* Similar to word vector 
learning but replaces the 
single scalar score with a 
Softmax/Maxent classifier 


e Training is again done via 
backpropagation which gives 
an error similar to the score 
in the unsupervised word 
vector learning model 
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The Model - Training 


e We already know the softmax classifier and how to optimize it 


* The interesting twist in deep learning is that the input features 
are also learned, similar to learning word vectors with a score: 


U, 


W,3 
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The Model - Training 


* All derivatives of layers beneath the score were multiplied by U, 
for the softmax, the error vector becomes the difference 
between predicted and gold standard distributions 


U, 


W,3 
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The secret sauce is the unsupervised 
pre-training on a large text collection 


NER 
CoNLL (F1) 


State-of-the-art* 97.24 89.31 
Supervised NN 96.37 81.47 
Unsupervised pre-training 97.20 88.87 


followed by supervised NN ** 


+ hand-crafted features*** 97.29 89.59 


* Representative systems: POS: (Toutanova et al. 2003), NER: (Ando & Zhang 
2005) 


** 130,000-word embedding trained on Wikipedia and Reuters with 11 word 
window, 100 unit hidden layer — for 7 weeks! — then supervised task training 


^, Features are character suffixes for POS and a gazetteer for NER 


Supervised refinement of the 
unsupervised word representation helps 


NER 
CoNLL (F1) 


Supervised NN 96.37 81.47 
NN with Brown clusters 96.92 87.15 
Fixed embeddings* 97.10 88.87 
C&W 2011** 97.29 89.59 


* Same architecture as C&W 2011, but word embeddings are kept constant 
during the supervised training phase 


** C&W is unsupervised pre-train + supervised NN + features model of last slide 
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Multi-Task Learning 
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Generalizing better to new 
tasks is crucial to approach 
AI 


Deep architectures learn 
good intermediate 
representations that can be 
shared across tasks 


Good representations make 
sense for many tasks 


m 
—— 


Combining Multiple Sources of 
Evidence with Shared Embeddings 


Relational learning 
Multiple sources of information / relations 
Some symbols (e.g. words, wikipedia entries) shared 


Shared embeddings help propagate information 
among data sources: e.g., WordNet, XWN, Wikipedia, 
FreeBase, ... 


Part 1.7 


Sharing statistical strength 


Sharing Statistical Strength 


e Besides very fast prediction, the main advantage of 
deep learning is statistical 


e Potential to learn from less labeled examples because 
of sharing of statistical strength: 


* Unsupervised pre-training & Multi-task learning 
* Semi-supervised learning 


Semi-Supervised Learning 


* Hypothesis: P(c|x) can be more accurately computed using 
shared structure with P(x) 


purely 
supervised 
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Semi-Supervised Learning 


* Hypothesis: P(c|x) can be more accurately computed using 
shared structure with P(x) 


semi- 
supervised 
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Deep autoencoders 


e Alternative to contrastive unsupervised word learning 
e Another is RBMSs (Hinton et al. 2006), which we don't cover today 


* Works well for fixed input representations (word vectors are not 
but bag of word representations are) 


1. Definition, intuition and variants of autoencoders 
2. Stacking for deep autoencoders 
3. Why do autoencoders improve deep neural nets so much? 
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Auto-Encoders 


* Multilayer neural net with target output = input 
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e Reconstruction=decoder(encoder(input)) 


a = tanh(Wz +b) 
x! = tanh(W'a+c) 
cost = e-a ^^ ee. © mm 
decoder t 
* Probable inputs have SOOO 
small reconstruction error encoder 


000 nee C input 
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PCA = Linear Manifold = Linear Auto- 
Encoder 


input x, O-mean 

features=code=h(x)=W x 

reconstruction(x)=W’ h(x) = WT Wx 

W = principal eigen-basis of Cov(X) x 


Linear manifold 


LSA example: 
x = (normalized) distribution 
of co-occurrence frequencies 
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The Manifold Learning Hypothesis 


e Examples concentrate near a lower dimensional 
“manifold” (region of high density where small changes are only 
allowed in certain direction=" 


KODE? AA 
KS T je. 
xoc? ANO x 

nee” 


Auto-Encoders Learn Salient 
Variations, like a non-linear PCA 


© e 
< 
" e 
Minimizing reconstruction error e 
forces latent representation of ® 
“similar inputs” to stay on e 


manifold 
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Auto-Encoder Varianks 
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Discrete inputs: cross-entropy or log-likelihood reconstruction 
criterion (similar to used for discrete targets for MLPs) 


Preventing them to learn the identity everywhere: 
* Undercomplete (eg PCA): bottleneck code smaller than input 


£ 


e Sparsity: penalize hidden unit activations so at or near 0 
[Goodfellow et al 2009] 

e Denoising: predict true input from corrupted input 
[Vincent et al 2008] 

e Contractive: force encoder to have small derivatives 
[Rifai et al 2011] 


Sparse autoencoder illustration for 
images 


Natural Images 


Learned bases: 


Test example 


2 ... 


[a,, ..., ag4] = [0, 0, ..., 0, 0.8, 0, ..., 0, 0.3, o, ...,0, 0.5, 0] 
8 (feature representation) 


Stacking Auto-Encoders 


e Can be stacked successfully (Bengio et al NIPS’2006) to form highly 
non-linear representations 


y 


" 


^(DOOOOQO ^»(DOOOOOQ 
W2 W2' Wa 
:0000000 HOOOÖO00 OOO0000 5i OOOÖ0OO 
W; W;' Wi 


Wi 


xO000090C0000 x x 
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Layer-wise Unsupervised Learning 


Layer-wise Unsupervised Pre-training 


features ToS 
input o0 .. 


Layer-wise Unsupervised Pre-training 


? 
reconstruction 00..0 = 000 © input 
of input \\ 
features te | 
input 09 
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Layer-wise Unsupervised Pre-training 


features ToS 
input o0 .. 


Layer-wise Unsupervised Pre-training 


More abstract 
features a v M 
features Y^ NI 
input o0 .. 
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Layer-wise Unsupervised Learning 


P 


reconstruction 
00 
of features 


More abstract 


MT 
tali N 


features V^ VAS 
input 00 


». 


Layer-wise Unsupervised Pre-training 


More abstract 
features a v M 
features Y^ NI 
input o0 .. 
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Layer-wise Unsupervised Learning 


Even more abstract 
features 


More abstract 
features 


features 


input 


Supervised Fine-Tuning 


Output A Target 
f(X) six | Y 


© 
Even more abstract "s / MA 
features o ... 0 
ji 


More abstract 
features 


features ToS 
input 00... 


Why is unsupervised pre-training 
working so well? 


Regularization hypothesis: 


* Representations good for P(x) 
are good for P(y |x) m 


1000 - 


Optimization hypothesis: i 
* Unsupervised initializations start 
near better local minimum of 

supervised training error 


* Minima otherwise not t A M NM NM NM ee 
achievable by ra ndom 7180600 -3000 -2000 -1000 0 1000 2000 3000 2000 
initialization 


-500 


—1000 


Erhan, Courville, Manzagol, 
Vincent, Bengio (JMLR, 2010) 
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Part 2 
Recursive Neural Networks 


Building on Word Vector Space Models 


X3 i 
5 x 5 
4 > 4 B 
4 
3 X Germany | 1 
x 3 
2 France | 
E | % Monday | 2 
2.5 
1 % Tuesday 4 
1.5 


the country of my birth 
the place where | was born 


But how can we represent the meaning of longer phrases? 


?5 By mapping them into the same vector space! 


How should we map phrases into a 
vector space? 


% the country of my birth 
The meaning (vector) of a sentence D the place where I was born 


is determined by Gina 
(1) the meanings of its words and XM France - 
(2) the rules that combine them. x ' 


% Tuesday 


Recursive Neural Nets 
can jointly learn 
compositional vector 
representations and 
parse trees 
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the country 


Recursive Neural Networks 
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Motivation 

Recursive Neural Networks for Parsing 

Theory: Backpropagation Through Structure 

Recursive Autoencoders 

Application to Sentiment Analysis and Paraphrase Detection 
Compositionality Through Recursive Matrix-Vector Spaces 
Relation classification 


Sentence Parsing: What we want 


6)" 


d DBGU 0 


Learn Structure and Representation 


(0 


(GE BE 
4 | | d | on t | i | : | 


Recursive Neural Networks for 
Structure Prediction 
Inputs: two candidate children’s representations 


Outputs: 
1. The semantic representation if the two nodes are merged. 


2. Score of how plausible the new node would be. 
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Recursive Neural Network Definition 


score = 1.3 IBI - parent 
Neural 
Network 
8 3 Same W parameters at all nodes 
5 3 of the tree 
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Related Work to Socher et al. (ICML 
2011) 
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Pollack (1990): Recursive auto-associative memories 


"EM 
Previous Recursive Neural Networks work by Goller & Küchler 


(1996), Costa et al. (2003) assumed fixed tree structure and 


used one hot vectors. 


Hinton (1990) and Bottou (2011): Related ideas about 
recursive models and recursive operators as smooth 


versions of logic operations 


Parsing a sentence with an RNN 


30 «0.0.0 . 


Neural Neural Neural Neural Neural 
Network Network Network Network Network 


6 D) D) 0 0 C 


The cat 
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Parsing a sentence 
4 [J 


Neural 
Network 


Neural Neural Neural 
Network Network Network 
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Parsing a sentence 


1.1 3. 
| 


etwor 


Parsing a sentence 


Max-Margin Framework — Details 


e The score of a tree is computed by 
the sum of the parsing decision 
scores at each node. 


* Similar to max-margin parsing (Taskar et al. 2004), a supervised 
max-margin objective 


zd — ) s(xo yi) — Max (s(xi, y) + A(y, yi)) 
i YEA(i) 


e The loss A(y,yi) penalizes all incorrect decisions 
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Backpropagation Through Structure 
(BTS) 


l 
e Introduced by Goller & Küchler (1996) SA 


e Principally the same as general backpropagation 


* Two differences resulting from the tree structure: 


e Split derivatives at each node 


e Sum derivatives of W from all nodes 
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BTS: Split derivakives at each node 


e During forward prop, the parent is computed using 2 children 


8 
3 


(2) | oF p = tanh(w RD 


e Hence, the errors need to be computed wrt each of them: 


3 
^ 
2 NS 
Pad KA 
LA 
3 
5 C4 C 


where each child's error is n-dimensional 
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BTS: Sum derivatives of all nodes 


e You can actually assume it's a different W at each node 
* |ntuition via example: 
O 
aw 1 WU Wz)) 


o Ó 
= rowGoroy (gw) Wa) + W at rowz)) 
= f(W(f(Wa2)) C'(Wz) + WF (Wx)x) 
e |ftake separate derivatives of each occurrence, we get same: 


gc OW GP a) + zt FOVEO o) 


Wix)) (f(Wix)) + FW2(F(Wiz)) (Wf (Wiz) 
Wix)) (f(W1x) + W2f(Wix)z) 
= P(W(f(Wz)) FW) + Wf(Wz)z) 


BTS: Optimization 


* As before, we can plug the gradients into a 
standard off-the-shelf L-BFGS optimizer 


e For non-continuous objective use subgradient 
method (Ratliff et al. 2007) 
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Labeling in Recursive Neural Networks 
NP 
e We can use each node’s 


representation as features for a Softmax 
softmax classifier: 


p(cp) = softmax(Sp) :] 


Network 
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Experiments: Parsing Short Sentences 


e Standard WSJ train/test L15 Dev L15 Test 


e Good results on short 


sentences Sigmoid NN (Titov 8. Henderson 2007) 89.3 


e More work is needed for 
longer sentences 


All the figures are adjusted for seasonal variations 
1. All the numbers are adjusted for seasonal fluctuations 
2. All the figures are adjusted to remove usual seasonal patterns 


Knight-Ridder wouldn't comment on the offer 
1. Harsco declined to say what country placed the order 
2. Coastal wouldn't disclose the terms 


Sales grew almost 796 to SUNK m. from SUNK m. 
1. Sales rose more than 7% to $94.9 m. from $88.3 m. 
2. Sales surged 4096 to UNK b. yen from UNK b. 


Short Paraphrase DPetection 


* Goalisto say which of candidate phrases are a good 
paraphrase of a given phrase 
* Motivated by Machine Translation 


* Initial algorithms: Bannard 8 Callison-Burch 2005 (BC 2005), Callison- 
Burch 2008 (CB 2008) exploit bilingual sentence-aligned corpora and 
hand-built linguistic constraints 


* We simply re-use our F1 of Paraphrase Detection 


system learned on 0.5 : 
parsing the WSJ 0.4 


0.3 
0.2 
0.1 

0 
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Paraphrase detection task, CCB data, 


the united 
states 


around the 
world 


it would be 


of capital 
punishment 


in the long 
run 
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Candidates with human goodness label (1-5) ordered by our system 


the usa (5) theus (5) united states (5) north america (4) united (1) 
the (1) ofthe united states (3) america (5) nations (2) we (3) 


around the globe(5) throughout the world(5) across the world(5) over 
the world(2) in the world(5) ofthe budget(2) ofthe world(5) 


it would represent (5) there will be (2) that would be (3) it would be 
ideal (2) it would be appropriate (2) itis (3) it would (2) 


of the death penalty (5) to death (2) the death penalty (2) of (1) 


in the long term (5) in the short term (2) forthe longer term (5) in 
the future (5) inthe end (3) inthe long-term (5) in time (5) of the (1) 


Scene Parsing 


e The meaning of a scene image is 
also a function of smaller regions, 


e how they combine as parts to form 
larger objects, 


and how the objects interact 
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Algorithm for Parsing Images 


Same Recursive Neural Network as for natural language parsing! 
(Socher et al. ICML 2011) 


Parsing Natural Scene Images 


0000000 


Grass People Building Tree 
@eseses) - j= (eeeeceece yD  @eeeees)) 000000 JJ 


Le Semantic 
(999999,  j / 9999999 "MCLIXIXI) (CIIXXXX) Representations 
GONID | | €000 0050 )| |Geseesss ^ Features 
a ^ € 9. Segments 
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PME DEM rod x [LM water Mois LU LL 
Method Accuracy 
Pixel CRF (Gould et al., ICCV 2009) 74.3 
Classifier on superpixel features 75.9 
Region-based energy (Gould et al., ICCV 2009) 76.4 
Local labelling (Tighe & Lazebnik, ECCV 2010) 76.9 
Superpixel MRF (Tighe & Lazebnik, ECCV 2010) 77.5 
Simultaneous MRF (Tighe & Lazebnik, ECCV 2010) 77.5 
Recursive Neural Network 78.1 


121 Stanford Background Dataset (Gould et al. 2009) 


Recursive Neural Networks 
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Motivation 

Recursive Neural Networks for Parsing 

Theory: Backpropagation Through Structure 

Recursive Autoencoders 

Application to Sentiment Analysis and Paraphrase Detection 
Compositionality Through Recursive Matrix-Vector Spaces 
Relation classification 


Recursive Autoencoders 


e Similar to Recursive Neural Net but instead of a 
supervised score we compute a reconstruction error 


1 
at each node. p. .((c1;c9]) = : ere] — [ese] |? 


se yz=f(W[x1;y1] + b) 


( = = aA à a ES zy 
(0009) (100090) 


(eee) 
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Semi-supervised Recursive 
Autoencoder 


* Tocapture sentiment and solve antonym problem, add a softmax classifier 
e Error is a weighted combination of reconstruction error and cross-entropy 
e Socher et al. (EMNLP 2011) 


Reconstruction error Cross-entropy error 
TE 
| Na 
(0000000) 0000000 090000 
Ww? 
W label) 


(e e e e e e 0) 


Ww) 
(0000000 0000000 
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Sentiment Detection 


e Sentiment detection is crucial to business 
intelligence, stock trading, ... 


3/18/11 at 4:00 PM a Comments 
Mentions of the 
Name 'Anne 
Hathaway' May 
Drive Berkshire 
Hathaway Stock 


By Patrick Huguenin 


The Huffington Post recently pointed 
out that whenever Anne Hathaway is 
in the news, the stock price for Warren 


Maybe she'll change her name to Halliburton. Just to 


see. Buffett's Berkshire Hathaway goes up. 
Really. When Bride Wars opened, the 
stock rose 2.61 percent. (Rachel 
125 Getting Married only kicked it up 0.44 percent, but, you know, that one was so 
light on plot compared to Bride Wars.) 


Sentiment Detection and Bag-of-Words 
Models 


e Most methods start with a bag of words 
+ linguistic features/processing/lexica 


e But such methods (including tf-idf) can't 
distinguish: 
* white blood cells destroying an infection 
- an infection destroying white blood cells 


126 


Single Scale Experiments: Movies 


Stealing Harvard doesn't care about 
cleverness, wit or any other kind of 
intelligent humor. 


a film of ideas and wry comic 
mayhem. 


Accuracy of Positive/Negative 
Sentiment Classification 


e Results on movie reviews (MR) and opinions (MPQA). 


e All other methods use hand-designed polarity shifting 
rules or sentiment lexica. 


e RAE: no hand-designed features, learns vector 


Phrase voting with lexicons 63.1 81.7 
Bag of features with lexicons 76.4 84.1 
Tree-CRF (Nakagawa et al. 2010) 77.3 86.1 


RAE (this work) 77.7 864 € 
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Sorted Negative and Positive N-grams 


Most Negative N-grams Most Positive N-grams 


bad; boring; dull; flat; pointless touching; enjoyable; powerful 


that bad; abysmally pathetic the beautiful; with dazzling 
is more boring; funny and touching; 
manipulative and contrived a small gem 

boring than anything else.; cute, funny, heartwarming; 

a major waste ... generic with wry humor and genuine 


loud, silly, stupid and pointless. ; , deeply absorbing piece that 
dull, dumb and derivative horror works as a; 
film. ... one of the most ingenious and 


entertaining; 
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Sentiment Distribution Experiments 
* Learn distributions over multiple complex 
sentiments > New dataset and task 


e Experience Project 
* http://www.experienceproject.com 
e “| walked into a parked car” 


* Sorry, Hugs; You rock; Tee-hee ; | understand; 
Wow just wow 


* Over 31,000 entries with 113 words on average 


Sentiment distributions 
e Sorry, Hugs; You rock; Tee-hee ; | understand; 
Wow just wow 


Predicted and Anonymous Confession 


Gold Distribution 


| 
| Ea In 


| well i think hairy women are attractive 


| i am a very succesfull business man. i make good money but i 
have been addicted to crack for 13 years. i moved 1 hour away 
|| from my dealers 10 years ago to stop using now i dont use daily 
but ... 


i 


Dear Love, | just want to say that | am looking for you. Tonight | 
- felt the urge to write, and | am becoming more and more 
i | frustrated that | have not found you yet. I'm also tired of spending 
ü so much heart on an old dream. ... 


UM 
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Experience Project most votes results 


Random 20 
Most freguent class 38 
Bag of words; MaxEnt classifier 46 
Spellchecker, sentiment lexica, SVM 47 
SVM on neural net word features 46 
RAE (this work) 50 


Average KL between 
gold and predicted 


label distributions: 
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Avg.Distr. BoW Features Word Vec. RAE 


Paraphrase Detection 


e Pollack said the plaintiffs failed to show that Merrill 
and Blodget directly caused their losses 


e Basically , the plaintiffs did not show that omissions 
in Merrill’s research caused the claimed losses 


e The initial report was made to Modesto Police 
December 28 


e It stems from a Modesto police report 
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How to compare the 
meaning of two 
sentences? 


Recursive Aukoencoders for Full 
Sentence Paraphrase Detection 


* Unsupervised Unfolding RAE and a pair-wise sentence 
comparison of nodes in parsed trees 


e Socher et al. (NIPS 2011) 


Recursive Autoencoder | Neural Network for Variable-Sized Input | Network for Variable-Sized | Neural Network for Variable-Sized Input | 


Pw PAR a Paraphrase Pairwise Classification Output 
VA (III) Ke IN Neural Network 
ee LS 4€ 1 Variable-Sized Pooling Layer 


"The cats catch mice Cats eat mice, de mice d 
ooo m Similarity Matrix 
3 ooo m 234567 
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Recursive Autoencoders for Full 
Sentence Paraphrase Detection 


e Experiments on Microsoft Research Paraphrase Corpus 
e (Dolan et al. 2004) 


Method ce 


Rus et al.(2008) 70.6 80.5 
Mihalcea et al.(2006) 70.3 81.3 
Islam et al.(2007) 72.6 81.3 
Qiu et al.(2006) 72.0 81.6 
Fernando et al.(2008) 74.1 82.4 
Wan et al.(2006) 75.6 83.0 
Das and Smith (2009) 739 82.3 
Das and Smith (2009) + 18 Surface Features 76.1 82.7 
F. Bu et al. (ACL 2012): String Re-writing Kernel 76.3 


Unfolding Recursive Autoencoder (NIPS 2011) 76.8 83.6 
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Recursive Aukoencoders for Full 
Sentence Paraphrase Detection 


[L[F Sentences Si 


(1) LLEYTON Hewitt yesterday traded his tennis racquet for his first sporting passion - 
Australian football - as the world champion relaxed before his Wimbledon title defence 

(2) LLEYTON Hewitt yesterday traded his tennis racquet for his first sporting passion- 
Australian rules football-as the world champion relaxed ahead of his Wimbledon defence 


(1) The lies and deceptions from Saddam have been well documented over 12 years 
(2) It has been well documented over 12 years of lies and deception from Saddam 


(1) Pollack said the plaintiffs failed to show that Merrill and Blodget directly caused their 
losses 
(2) Basically , the plaintiffs did not show that omissions in Merrill’s research caused the 


claimed losses 


(1) Prof Sally Baldwin, 63, from York, fell into a cavity which opened up when the struc- 
ture collapsed at Tiburtina station, Italian railway officials said 


(2) Sally Baldwin, from York, was killed instantly when a walkway collapsed and she fell 
into the machinery at Tiburtina station 


(1) Bremer, 61, is a onetime assistant to former Secretaries of State William P. Rogers and 
Henry Kissinger and was ambassador-at-large for counterterrorism from 1986 to 1989 
(2) Bremer, 61, is a former assistant to former Secretaries of State William P. Rogers and 


(1) The initial report was made to Modesto Police December 28 
(2) It stems from a Modesto police report 


Recursive Neural Networks 
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Motivation 

Recursive Neural Networks for Parsing 

Theory: Backpropagation Through Structure 

Recursive Autoencoders 

Application to Sentiment Analysis and Paraphrase Detection 
Compositionality Through Recursive Matrix-Vector Spaces 
Relation classification 


Compositionalit Through Kecursive 
Matrix-Vector Spaces 


p = tanh(W f- b) 


e But what if words act mostly as an operator, e.g. “very” in 


very good 
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Compositionality Through Recursive 
Matrix-Vector Recursive Neural Networks 


p = tanh(W t b) p = tanh(W (s Jo) 
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Recursive Matrix-Vector Model 


- vector 


o0. i 
matrix 
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Predicting Sentiment Distributions 
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fairly annoying 


—e— MV-RNN 
-+- RNN 


not annoying 


—e— MV-RNN 
-+-RNN 


unbelievably annoying 


—e— MV-RNN 
-+- RNN 


fairly awesome 


—e— MV-RNN 
-+- RNN 


not awesome 


—e— MV-RNN 
-+--RNN 


unbelievably awesome 


—e— MV-RNN 
-+- RNN 


fairly sad 


—e— MV-RNN 
-+- RNN 


not sad 


—e— MV-RNN 
-+- RNN 
—— Ground Truth 


unbelievably sad 


—e— MV-RNN 
-+- RNN 


MV-RNN for Relationship Classification 


Classifier: Message-Topic 


More info at 
EMNLP talk 
on July 14'^ 


(9 9) 


(6 ©) 


(6 ©) 


66 OO 
the [movie] showed [wars] … 


Relationship 


Cause- 
Effect(e2,e1) 


Entity- 
Origin(e1,e2) 


Message- 
Topic(e2,e1) 
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Sentence with labeled nouns for which 


to predict relationships 


Avian [influenza]e1 is an infectious 
disease caused by type a strains of the 
influenza [virus]e2. 


The [mother]e: left her native [land ]ez 
about the same time and they were 
married in that city. 


Roadside [attractions]e1 are frequently 
advertised with [billboards]e2 to attract 
tourists. 


Classifier Feature Sets F1 
SVM POS, stemming, syntactic patterns 60.1 
SVM word pair, words in between 72.5 
SVM POS, WordNet, stemming, syntactic 74.8 
patterns 
SVM POS, WordNet, morphological fea- 77.6 
tures, thesauri, Google n-grams 
MaxEnt POS, WordNet, morphological fea- 77.6 
tures, noun compound system, the- 
sauri, Google n-grams 
SVM POS, WordNet, prefixes and other 82.2 
morphological features, POS, depen- 
dency parse features, Levin classes, 
PropBank, FrameNet, NomLex-Plus, 
Google n-grams, paraphrases, Tex- 
tRunner 
RNN 74.8 
Lin.MVR 73.0 
MV-RNN - 79.1 
RNN POS, WordNet, NER 77.6 
Lin.MVR  POS,WordNet,NER 78.7 
MV-RNN  POS,WordNet,NER 82.4 


Summary: Recursive Deep Learning 


e Recursive Deep Learning can predict hierarchical structure and classify the 
structured output using compositional vectors 


e State-of-the-art performance on 
* Sentiment Analysis on multiple corpora 
* Paraphrase detection on the MSRP dataset 
* Relation Classification on SemEval 2011, Task8 
* Vision modality (Stanford background dataset) 


Lea 
eese y3=f(W" [x;y] + b) 


e Code on www.socher.org 


ar 
«eee ya=f(W”[x2;y1] + b) 
ie ülik +b) 
| 
3 se a TX 
Recursive Autoencoder Neural Network for Variable-Sized Input 7 z 
Recursive Matrix-Vector Model 
7@sso 50000 Pairwise Classification Output SO © - vector 
yO (655550) f(Ba, Ab)= ee i 
6 @ee® (GOOD) 49969 x ái Neural Network 38|- matrix 
Ba=50) Ab= 56 m 
16 2 3688 4C8eH 108002 ) 35858) Variable-Sized Pooling Layer IG FA NC 
The cats catch mice , Cats eat mice À CD GS Ge 
áj Similarity Matrix id a Cae ) quip 
a 1 
123456 7 cor 2ojoo] cofes] 
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Part 3 


1. Applications 
1. Neural language models 
2. Structured embedding of knowledge bases 
3. Assorted other speech and NLP applications 


Jus 


Resources (readings, code, ...) 
3. Tricks of the trade 
4. Discussion: Limitations, advantages, future directions 
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Existing NLP Applications 


Language Modeling 

* Speech Recognition 

* Machine Translation 
Part-Of-Speech Tagging 
Chunking 

Named Entity Recognition 
Semantic Role Labeling 
Sentiment Analysis 
Paraphrasing 
Question-Answering 
Word-Sense Disambiguation 


146 


147 


Part 3.1: Applications 


Neural Language Models 


Language Modeling 


e Predict P(next word | previous word) 
e Gives a probability for a longer sequence 
e Applications to Speech, Translation and Compression 


e Computational bottleneck: large vocabulary V means that 
computing the output costs hidden units x |V|. 
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Neural Language Model 


i-th output = P(w, = i | context) 


e Bengio et al NIPS'2000 
and JMLR 2003 "A 

Neural Probabilistic 

Language Model" 


* Each word represented by 
a distributed continuous- 


normalized exponential 


most| computation here 


valued code m 
e Generalizes to sequences E 
of words that are TU Ru haces UU 
. MIINA look-up shared parameters 
semantically similar to in C across Words 
training sequences ieai wi-2 TM 
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Recurrent Neural Net Language 


Modeling for ASR 


*  [Mikolov et al 2011] 
Bigger is better... 
experiments on Broadcast 
News NIST-RTO4 


perplexity goes from 
140 to 102 


Paper shows how to 

train a recurrent neural net 
with a single core in a few 
days, with > 1% absolute 
improvement in WER 


Code: nttp://www.fit.vutbr.cz/-imikolov/rnnlm/ 
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WER on eval [96] 


Wr-3 


Wt-2 


4.5 


e | — RNN 
OOL Prii] ko RNN«KN4 
p ad 2 de käre KN4 
vor n trt] — + — RNNME | 
- —O— : RNNME+KN4 | 
13.5 . . . . . A 
13 


2 3 

10 10 10 
Hidden layer size 
P (w:| context) 


Language Modeling Output Bottleneck 


151 


| T 


[Schwenk et al 2002]: only predict most frequent words 
(short list) and use n-gram for the others 


[Morin & Bengio 2005; Blitzer et al 2005; Mnih & Hinton 
2007,2009; Mikolov et al 2011]: hierarchical representations, 
multiple output groups, conditionally computed, predict 

e P(word category | context) £23 

e  P(sub-category | context, category) 

e P(word | context, sub-category, category) 


* Hard categories, can be arbitrary e AA. AA, 
[Mikolov et al 2011] 9| words within each category 


Neural Net Language Modeling for ASR 


e — [Schwenk 2007], real-time ASR, perplexity AND word error rate improve 
(CTS evaluation set 2003), perplexities go from 50.1 to 45.5 q 


backoff LM, CTS data KXXXX 
hybrid LM, CTS data ESS 
backoff LM, CTS+BN data L 
hybrid LM, CTS+BN data 77772 


System 3 


Eval03 word error rate 


152 in-domain LM training corpus size 


Application to Statistical Machine 


Translation 
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g 


Schwenk (NAACL 2012 workshop on the future of LM) 
* 41M words, Arabic/English bitexts + 151M English from LDC 


Perplexity down from 71.1 (6 Gig back-off) to 56.9 (neural 
model, 500M memory) 


+1.8 BLEU score (50.75 to 52.28) 


Can take advantage of longer contexts 


Code: http://lium.univ-lemans. £r/cs1m/ 


Part 3.1: Applications 


Structured embedding of 
knowledge bases 


Score 


Modeling Semantics 


Learning Structured Embeddings of 
Knowledge Bases, (Bordes, Weston, 
Collobert & Bengio, AAAI 2011) 


( door 1, has part, lock 2) 


Joint Learning of Words and Meaning 
Representations for Open-Text 
Semantic Parsing, (Bordes, Glorot, 
Weston & Bengio, AISTATS 2012) 
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Modeling Relations with Matrices 


energy 


choose matrices 


am 


Ihs 


Model (Ihs, relation, rhs) 

Each concept = 1 embedding vector 

Each relation = 2 matrices. Matrix acts like an operator. 
Ranking criterion 


Energy = low for training examples, high o/w 
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= vector 


relation rhs 


Question Answering: impliciti 
adding new relations to WN or FB 


Model (All) TextRunner 
Ihs army 
rel attacked 
troop- NN 4 Israel 
top | armed service NN 1, the village 
ranked -Ship NN 1 another army 
rhs «territory NN 1 the city 
Jmilitary unit NN. 1 the fort 
business firm NN. 1 People 
top person NN 1 Players 
ranked family NN. 1 one 
lhs  payoff NN. 3 Students 
card game NN 1 business 
rel earn VB 1 earn 
rhs -money NN 1 money 


MRs inferred from text 
define triplets between 
WordNet synsets. 


Model captures 
knowledge about 
relations between nouns 
and verbs. 


 Implicit addition of 
new relations to 


WordNet! 


> Generalize Freebase! 


Embedding Nearest Neighbors of 
Words & Senses 


_mark_NN _mark_NN_1 _mark_NN_2 
_indication_NN _score_NN_1 _marking_NN_1 
print NN 3 number NN 2 symbolizing NN 1 
print NN -gradation NN naming NN 1 
-roll NN «evaluation NN 1 -marking NN 
_pointer NN tier NN 1 punctuation NN 3 
take VB «Canary NN different JJ 1 
bring VB «Sea mew NN 1 .eccentric NN 
.put VB -yellowbird NN. 2 -dissimilar JJ 
-ask VB «Canary bird NN. 1 «Same JJ 2 


.hold VB JJarus marinus NN 1, . similarity NN 1 
provide VB -mew NN common JJ. 1 
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Word Sense Disambiquation 


e Senseval-3 results 70 
(only sentences with 60 
Subject-Verb-Object e 
structure) ü 


MFS=most frequent sense 
All=training from all sources Random MFS Gamble WN WN+Text A 


All+MFS 


Gamble=Decadt et al 2004 


(Senseval-3 SOA) 


e XWN results 4o 
XWN = eXtended WN T 
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Part 3.1: Applications 


Assorted Speech and NLP 
Applications 


Learning Multiple Word Vectors 


* Tackles problems with polysemous words 


* Can be done with both standard tf-idf based 
methods [Reisinger and Mooney, NAACL 2010] 

e Recent neural word vector model by [Huang et al. ACL 2012] 
learns multiple prototypes using both local and global context 

e State of the art Local Context Global Context 
correlations with 
human similarity 
judgments 


161 


Learning Multiple Word Vectors 


e Visualization of learned word vectors from 
Huang et al. (ACL 2012) 


translation els fantasy stars 


ponia 
laundering mévie— | 
inals 


transaction talk réplevision | 
finance ki constellation 
banking Ee oracle — - 
flash mE asteroid 
; galaxy moon 
ERRi pality direction planet 
boundary 
gap 
u 
territory 


D Rtiitapods 


Phoneme-Level Acoustic Models 


e [Mohamed et al, 2011, IEEE Tr.ASLP] 2 [I 


* Unsupervised pre-training as Deep Belief Nets (a stack of 
RBMs), supervised fine-tuning to predict phonemes 
* Phoneme classification on TIMIT: 
e CD-HMM: 27.3% error 
e CRFs: 26.6% 
* Triphone HMMs w. BMMI: 22.7% 
* Unsupervised DBNs: 24.5% 
* Fine-tuned DBNs: 20.7% 
* Improved version by Dong Yu is RELEASED IN MICROSOFT'S 
ASR system for Audio Video Indexing Service 
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Domain Adaptation for 
Sentiment Analysis 


e [Glorot et al, ICML 2011] 
beats SOTA on Amazon 
benchmark, 25 domains 

e Embeddings pre-trained in 
denoising auto-encoder 

e  Disentangling effect 
(features specialize to 
domain or sentiment) 


Transfer ratio 


1 
Baseline SCL MCT SFA T-SVM SDA SDAsh 
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Part 3.2: Resources 


Resources: Tutorials and Code 


Related Tutorials 
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See “Neural Net Language Models” Scholarpedia entry 


Deep Learning tutorials: http://deeplearning.net/tutorials 


Stanford deep learning tutorials with simple programming 
assignments and reading list 
http://deeplearning.stanford.edu/wiki/ 

Recursive Autoencoder class project 
http://cseweb.ucsd.edu/“elkan/250B/learningmeaning.pdf 


Graduate Summer School: Deep Learning, Feature Learning 
http://www.ipam.ucla.edu/programs/gss2012 


ICML 2012 Representation Learning tutorial http:// 
www.iro.umontreal.ca/~bengioy/talks/deep-learning-tutorial-2012.html 


Paper references in separate pdf 


Software 


e Theano (Python CPU/GPU) mathematical and deep learning 
library http://deeplearning.net/software/theano 
* Can do automatic, symbolic differentiation 
e Senna: POS, Chunking, NER, SRL 
e by Collobert et al. http://ronan.collobert.com/senna/ 
e State-of-the-art performance on many tasks 
e 3500 lines of C, extremely fast and using very little memory 
e Recurrent Neural Network Language Model 
http://www.fit.vutbr.cz/^imikolov/rnnlm 
e Recursive Neural Net and RAE models for paraphrase detection, 
sentiment analysis, relation classification www.socher.org 
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Software: what's next 


e Off-the-shelf SVM packages are useful to researchers 
from a wide variety of fields (no need to understand 
RKHS). 


e One of the goals of deep learning: Build off-the-shelf 
NLP classification packages that are using as input only 
raw text, possibly with a label. 
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Part 3.3: Deep Learning Tricks 


Deep Learning Tricks 


Deep Learning Tricks of the Trade 


* Y. Bengio (2012), “Practical Recommendations for Gradient- 
Based Training of Deep Architectures" 


* Unsupervised pre-training 


* Stochastic gradient descent and setting learning rates 


* Main hyper-parameters 

Learning rate schedule & Early stopping 
Minibatches 

Parameter initialization 

Number of hidden units 

L1 or L2 weight decay 

Sparsity regularization 


* Debugging > Finite difference gradient check (Yay) 
* How to efficiently search for hyper-parameter configurations 
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Non-linearities: Whak’s used 


logistic (“sigmoid”) tanh 


Ke) = Voss f(z) = tanh(;) = —— 


e? +e”ž?' 


tank functbon 


1 1 — 


$ 4 3 2 431 0 10 2 3 ^4 $ 


6 4 2 0 2 4 6 


tanh is just a rescaled and shifted sigmoid (2 x as steep, [-1,1]): 
tanh(z) = 2logistic(2z)—1 


tanh is what is most used and often performs best for deep nets 
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Non-Linearities: There are various 
other choices 


hard tanh soft sign rectifier 
—1 ifx«-1 4 a 
HardTanh(x)=4 x if-1<=x<=1 SOftsign(z- ——  rect(z) = max(z,0) 
1 ifx>1 1+ ja 


tarh hnebon 1 


— Softsign | 
25 5 — 2 i 0 1 2 3 


e hard tanh similar but computationally cheaper than tanh and saturates hard. 


e [Glorot and Bengio AISTATS 2010, 2011] discuss softsign and rectifier 
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Stochastic Gradient Descent (SGD) 
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Gradient descent uses total gradient over all examples per 
update, SGD updates after only 1 or few examples: 


E OL(24,0) 
g(0 QE — e, m 
00 


L = loss function, z,= current example, 6 = parameter vector, and 
e, = learning rate. 


Ordinary gradient descent is a batch method, very slow, should 
never be used. Use 279 order batch method such as LBFGS. On 
large datasets, SGD usually wins over all batch methods. On 
smaller datasets LBFGS or Conjugate Gradients win. Large-batch 
LBFGS extends the reach of LBFGS [Le et al ICML'2011]. 


Learning Rates 


e Simplest recipe: keep it fixed and use the same for all 
parameters. 

e Collobert scales them by the inverse of square root of the fan-in 
of each neuron 

* Better results can generally be obtained by allowing learning 
rates to decrease, typically in O(1/t) because of theoretical 
convergence guarantees, e.g., 


EOT 
E, = ——— 
"  max(t, T) 


with hyper-parameters €, and T. 
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Long-Term Dependencies 
"m Clipping Trick 
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In very deep networks such as recurrent networks (or possibly 
recursive ones), the gradient is a product of Jacobian matrices, 
each associated with a step in the forward computation. This 
can become very small or very large guickly [Bengio et al 1994], 
and the locality assumption of gradient descent breaks down. 


L= Llsr(sr-1(--.8t41(8t,--.)))) 
OL | OL Osr OSt44 
0s,  OspOsp 1 Os: 


The solution first introduced by Mikolov is to clip gradients e 
to a maximum value. Makes a big difference in RNNs CJ 


Parameter Initialization 


e Initialize hidden layer biases to O and output (or reconstruction) 
biases to optimal value if weights were 0 (e.g. mean target or 
inverse sigmoid of mean target). 


e Initialize weights ~ Uniform(-r,r), r inversely proportional to fan- 
in (previous layer size) and fan-out (next layer size): 


6/(fan-in + fan-out) 


for tanh units, and 4x bigger for sigmoid units [Glorot AISTATS 2010] 


Note: for embedding weights, fan-in=1 and we don't care about 


fan-out, Collobert uses Uniform(-1,1). 
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Part 3.4: Discussion 


Discussion: Limitations, 
Advantages, Future Directions 
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Concerns 
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Many algorithms and variants (burgeoning field) 


Hyper-parameters (layer size, regularization, possibly 
learning rate) 


e Use multi-core machines, clusters and random 
Sampling for cross-validation (Bergstra & Bengio 2012) 


e Pretty common for powerful methods, e.g. BM25 


* Can use (mini-batch) L-BFGS instead of SGD 


Concerns 


e Not always obvious how to combine with existing NLP 


e Simple: Add word or phrase vectors as features. Gets 
close to state of the art for NER, [Turian et al, ACL 
2010] 


* Integrate with known structures: Recursive and 
recurrent networks for trees and chains 


e Your research here 
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Concerns 


e Slower to train than linear models 


e Only by a small constant factor, and much more 
compact than non-parametric (e.g. n-gram models) 


e Very fast during inference/test time (feed-forward 
pass is just a few matrix multiplies) 


e Need more training data 


e Can handle and benefit from more training data, 
suitable for age of Big Data (Google trains neural 
nets with a billion connections, [Le et al, ICML 2012]) 
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Concerns 


e There aren’t many good ways to encode prior 
knowledge about the structure of language into deep 
learning models 


e There is some truth to this. However: 


e You can choose architectures suitable for a problem 
domain, as we did for linguistic structure 


e You can include human-designed features in the first 
layer, just like for a linear model 


e And the goal is to get the machine doing the learning! 
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Concern: 
Problems with model interpretability 


e No discrete categories or words, everything is a continuous 
vector. We'd like have symbolic features like NP, VP, etc. and 
see why their combination makes sense. 


* True, but most of language is fuzzy and many words have soft 
relationships to each other. Also, many NLP features are 
already not human-understandable (e.g., concatenations/ 
combinations of different features). 
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Concern: non-convex optimization 


e Can initialize system with convex learner 
e Convex SVM 


e Fixed feature space 


e Then optimize non-convex variant (add and tune learned 
features), can’t be worse than convex learner 
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Advantages 
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Despite a small community in the intersection of deep 
learning and NLP, already many state of the art results 
on a variety of language tasks 


Often very simple matrix derivatives (backprop) for 
training and matrix multiplications for testing > fast 
implementation 


Fast inference and well suited for multi-core CPUs/GPUs 
and parallelization across machines 


Learning Multiple Levels of 
Abstraction 


e The big payoff of deep learning is to allow learning 
higher levels of abstraction 


* Higher-level abstractions disentangle the factors of 


transfer 
* More abstract representations 
> Successful transfer (domains, 


languages) 
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